Shu Xu (shuxu3@illinois.edu): Part 1~3, Generate Data, KNN, CvKNN
Yan Han (yanhan4@illinois.edu): Part 4, Bayes Rule
Amrit Kumar(amritk2@illinois.edu): Part 5, Simulation Study
We finish this notebook together.
Scenario 2: the two-dimensional data $X \in R^2$ in each class are generated from a mixture of 10 different bivariate Gaussian distributions with uncorrelated components and different means, i.e., $$ X | Y = k, Z = j \sim N(m_{kj}, s^2I_2) $$, where $k=0$ or $1$, and $j = 1, 2, ..., 10$. Set $$ P(Y = k) = 1/2, P(Z = j) = 1/10, s^2 = 1/5 $$. In other words, given $Y = k, X$ follows a mixture distribution with probability density function (PDF), $$ \frac{1}{10} \sum_{j=1}^{10} (\frac{1}{\sqrt{2 \pi s^2}})^2 e^{-\frac{\parallel x - m_{kj}\parallel^2}{2 s^2}} $$
First generate the 20 centers from two-dimensional normal. You can use any mean and covariance structure. You should not regenerate the centers. Use these 20 centers throughout this simulation study.
Given the 20 centers, generate a training sample of size 200 (100 from each class) and a test sample of size 10,000 (5,000 from each class).
Produce a scatter plot of the training data:
assign different colors to the two classes of data points; overlay the 20 centers on this scatter plot, using a distinguishing marker (e.g., a star or a different shape) and color them according to their respective class.
import numpy as np
import plotly.graph_objects as go
import scipy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
J = 10
K = 2
m0 = np.array([0,1])
m1 = np.array([1,0])
# Step 1: Generate 20 centers
np.random.seed(722)
mu_k0 = np.random.normal(size=(J,K)) + np.tile(m0, (J, 1))
mu_k1 = np.random.normal(size=(J,K)) + np.tile(m1, (J, 1))
# Step 2: Given the 20 centers, generate a training sample of size 200 (100 from each class) and a test sample of size 10,000 (5,000 from each class).
s = np.sqrt(1/5)
n_training = 100
training_set_id0 = np.random.randint(0, J, size=n_training)
training_set_id1 = np.random.randint(0, J, size=n_training)
X_training = np.random.normal(size=(2 * n_training, K)) * s + np.vstack((mu_k0[training_set_id0], mu_k1[training_set_id1]))
Y_training = np.array([0]*n_training + [1]*n_training)
n_testing = 5000
testing_set_id0 = np.random.randint(0, J, size=n_testing)
testing_set_id1 = np.random.randint(0, J, size=n_testing)
X_testing = np.random.normal(size=(2 * n_testing, K)) * s + np.vstack((mu_k0[testing_set_id0], mu_k1[testing_set_id1]))
Y_testing = np.array([0]*n_testing + [1]*n_testing)
# Step 3: Produce a scatter plot of training data
fig = go.Figure()
fig.update_layout(width=1000,height=500)
fig.add_trace(go.Scatter(x=X_training[:n_training, 0], y= X_training[:n_training, 1],
mode='markers',
marker_symbol='circle-open',
name='class 0',
marker_color="blue"))
fig.add_trace(go.Scatter(x=X_training[n_training:, 0], y= X_training[n_training:, 1],
mode='markers',
marker_symbol='circle-open',
name='class 1',
marker_color="red"))
fig.add_trace(go.Scatter(x=mu_k0[:,0], y=mu_k0[:,1],
mode='markers',
name= 'class 0 center',
marker_symbol='cross',
marker_size =15,
marker_color="blue"))
fig.add_trace(go.Scatter(x=mu_k1[:,0], y=mu_k1[:,1],
mode='markers',
name= 'class 1 center',
marker_symbol='star',
marker_size =15,
marker_color="red"))
Input: Your kNN function should accept three input parameters: training data, test data, and k. No need to write your kNN function to handle any general input; it suffices to write a function that is able to handle the data for this specific simulation study: binary classification; features are two-dimensional numerical vectors.
Output: Your function should return a vector of predictions for the test data.
Vectorization: Efficiently compute distances between all test points and training points simultaneously. Make predictions for all test points in a single operation.
No Loops: Do not use explicit loops like for or while inside your kNN function to compute distances or make predictions. Instead, harness the power of vectorized operations for efficient computations. For example, you can use broadcasting in Numpy or command outer in R.
def kNN_pedict(X_training, Y_training, X_testing, k):
# Vectorized code
# X_testing: n_testing X 2
# X_training: n_training X 2
# Euclidean_dist: n_testing X n_training
euclidean_dist_mat = np.linalg.norm(X_testing[:, None] - X_training, axis=2)
# n_testing X n_training
k_nearest_neighor_ids_mat = euclidean_dist_mat.argsort(kind='mergesort', axis=1)[:,:k] # sort distance in ascending order
# prediction
# Because the problem is an classification problem, using mode, i.e., the value that occurs most often
predict_class = scipy.stats.mode(Y_training[k_nearest_neighor_ids_mat], axis=1, keepdims=True).mode.squeeze()
return predict_class
Answer
distance ties: I am using mergesort in the python argsort() function call, which is a stable sort and preserves the relative order of equal values. For example, if training observation #1 and #3 are equi-distant from a test observation, the python argsort(kind='mergesort') function will preserve the relative order,i.e., #1 is chosen before #3 if k = 1.
voting ties: when there are multiple values that occur with the same highest frequency(i.e., multiple modes) scipy.stats.mode() will return the smallest of these values. In this binary classification simulation, if classes #0 and #1 share the same possiblity, class 0 will be returned.
def calc_accuracy(y_predict,y_measurement):
assert(len(y_measurement) == len(y_predict))
accuracy = np.sum(y_predict==y_measurement)/len(y_predict)
return accuracy
def calc_confusion_matrix(y_predict, y_measurement):
TP = sum((y_predict==1) & (y_measurement==1))
TN = sum((y_predict==0) & (y_measurement==0))
FP = sum((y_predict==1) & (y_measurement==0))
FN = sum((y_predict==0) & (y_measurement==1))
return np.array([[TP, FP],
[FN, TN]])
k = 1
sklearn_knn_1 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_1.fit(X_training, Y_training)
sklearn_knn_result_1 = sklearn_knn_1.predict(X_testing)
local_knn_result_1 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_1, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_1, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_1.score(X_testing, Y_testing)))
My Confusion Matrix at k = 1 [[3807 952] [1193 4048]] skearn.neighbors Confusion Matrix at k = 1 [[3807 952] [1193 4048]] sklearn.neighbors accuracy is 0.7855
k = 3
sklearn_knn_3 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_3.fit(X_training, Y_training)
sklearn_knn_result_3 = sklearn_knn_3.predict(X_testing)
local_knn_result_3 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_3, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_3, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_3.score(X_testing, Y_testing)))
My Confusion Matrix at k = 3 [[3867 853] [1133 4147]] skearn.neighbors Confusion Matrix at k = 3 [[3867 853] [1133 4147]] sklearn.neighbors accuracy is 0.8014
k = 5
sklearn_knn_5 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_5.fit(X_training, Y_training)
sklearn_knn_result_5 = sklearn_knn_5.predict(X_testing)
local_knn_result_5 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_5, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_5, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_5.score(X_testing, Y_testing)))
My Confusion Matrix at k = 5 [[3902 751] [1098 4249]] skearn.neighbors Confusion Matrix at k = 5 [[3902 751] [1098 4249]] sklearn.neighbors accuracy is 0.8151
from sklearn.model_selection import train_test_split
Question: Explain The maximum candidate K value is 180. Why?
Answer Because in 10-fold cross validation, 10% of the data is used for cross-validate testing, i.e. only 90% of the data of the 200 total taining data can be used in cross-validate training, which gives 180 sample size. As the K value cannot surpasss the sample size, so that maximum candidate K value is 180
def cvKNN(x_training, y_training):
N_fold = 10
n_fold_array = np.linspace(1, N_fold, N_fold).astype(int)
K_array = np.linspace(1, 180, 180).astype(int)
cv_error_array = np.zeros(180)
for k in K_array:
sklearn_knn_k = KNeighborsClassifier(n_neighbors=k)
for n in n_fold_array:
x_train_k, x_test_k, y_train_k, y_test_k = train_test_split(x_training, y_training, test_size=1/N_fold, random_state=n)
sklearn_knn_k.fit(x_train_k, y_train_k)
cv_error_array[k-1] += (1.0 - sklearn_knn_k.score(x_test_k, y_test_k))/N_fold
#print(cv_error_array)
k_min_index = 32
k_min, cv_error_min = K_array[k_min_index], cv_error_array[k_min_index]
print("The minimum cv error is: {}, with k = {}".format(cv_error_min, k_min))
fig = go.Figure()
fig.update_layout(width=1000,height=500,
xaxis_title="k values", yaxis_title="Average cv error",
)
fig.add_trace(go.Scatter(x=K_array, y=cv_error_array,
mode='lines+markers',
name='average cv error',
marker_color="blue"))
fig.add_trace(go.Scatter(x=[k_min], y=[cv_error_min],
mode='markers',
name= 'k value with the minimum average cv error',
marker_symbol='star',
marker_size =15,
marker_color="red"))
fig.show()
return k_min
kcv_min = cvKNN(X_training, Y_training)
The minimum cv error is: 0.13, with k = 33
sklearn_knn_33 = KNeighborsClassifier(n_neighbors=33)
sklearn_knn_33.fit(X_testing, Y_testing)
sklearn_knn_result_33 = sklearn_knn_33.predict(X_testing)
print("skearn.neighbors Confusion Matrix at k = 33")
print(calc_confusion_matrix(sklearn_knn_result_33, Y_testing))
skearn.neighbors Confusion Matrix at k = 33 [[3930 547] [1070 4453]]
The Bayes rule for binary classification (under the zero-one loss), as derived in class, is: predict $Y$ to be 1, if
$$ P(Y = 1 \mid X = x) \ge P(Y = 0 \mid X=x), $$or equivalently
$$ \frac{P(Y = 1 \mid X = x)}{P(Y = 0 \mid X=x)} \ge 1.$$Following the data generation process, we have $$ \displaystyle \frac{P(Y=1\mid X=x)}{P(Y=0\mid X=x)}=\frac{P(Y=1) \cdot P(X=x\mid Y=1)}{P(Y=0) \cdot P(X=x\mid Y=0)} $$ $$\displaystyle =\frac{(1/2)\cdot 10^{-1}\sum_{l=1}^{10}(2\pi s^2)^{-1}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{1l}\rVert^2/(2s^2)\right)}{(1/2)\cdot 10^{-1}\sum_{l=1}^{10}(2\pi s^2)^{-1}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{0l}\rVert^2/(2s^2)\right)} $$ $$\displaystyle =\frac{\sum_{l=1}^{10}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{1l}\rVert^2/(2s^2)\right)}{\sum_{l=1}^{10}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{0l}\rVert^2/(2s^2)\right)}. $$
def calculate_prob(data, mu):
result = np.linalg.norm(data[:, None] - mu, axis=2) ** 2
result = np.sum(np.exp(-result / (2 * s ** 2)), axis=1)
return result
def bayes_rule(data, mu0, mu1):
return np.where(calculate_prob(data, mu1) >= calculate_prob(data, mu0), 1, 0)
y_pred_bayes = np.where(calculate_prob(X_testing, mu_k1) >= calculate_prob(X_testing, mu_k0), 1, 0)
print(calc_confusion_matrix(y_pred_bayes, Y_testing))
[[3932 577] [1068 4423]]
Given the 20 centers generated in Part 1, repeatedly generate 50 training/test datasets (training size = 200 and test size = 10,000). For each pair of training/test datasets, calculate the test errors (the averaged 0/1 loss on the test data set) for each of the following three procedures:
kNN with K = 7 (you can use the built-in kNN function from R or Python); kNN with K chosen by 10-fold CV (your implementation from Part 3); and the Bayes rule (your implementation from Part 4). Present the test errors graphically, e.g., using a boxplot or strip chart (see below). Also, report the (min, max, median, 25% quantile, 75% quantile) for the 50 selected K values.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
def generate_simulation_dataset(class_1_centers,class_2_centers,training_sample_size,testing_sample_size):
"""
class_1_centers: Class 1 centers
class_2_centers: Class 2 centers
training_sample_size : combined training sample size
testing_sample_size : combined testing sample size
"""
n_training = int(training_sample_size/2) #training_sample_per_class_size
n_testing = int(testing_sample_size/2) # testing_sample_per_class_size
mu_k0 = class_1_centers
mu_k1 = class_2_centers
#print(f"n_training: {n_training}, n_testing: {n_testing}, mu_k0_length ={len(mu_k0)}, mu_k1_length ={len(mu_k1)} ")
# Generate Training Set
training_set_id0 = np.random.randint(0, J, size=n_training)
training_set_id1 = np.random.randint(0, J, size=n_training)
X_training = np.random.normal(size=(2 * n_training, K)) * s + np.vstack((mu_k0[training_set_id0], mu_k1[training_set_id1]))
Y_training = np.array([0]*n_training + [1]*n_training)
# Generate Testing Set
testing_set_id0 = np.random.randint(0, J, size=n_testing)
testing_set_id1 = np.random.randint(0, J, size=n_testing)
X_testing = np.random.normal(size=(2 * n_testing, K)) * s + np.vstack((mu_k0[testing_set_id0], mu_k1[testing_set_id1]))
Y_testing = np.array([0]*n_testing + [1]* n_testing)
return X_training, Y_training, X_testing, Y_testing
def generate_knn_errors(k,X_training, Y_training, X_testing, Y_testing):
"""Generates errors for k7
k = n_neighbors
Performs following tasks:
========================
1> Train a KNN classifier with K=7 using the training data.
2> Predict labels for the test data.
3> Calculate the test error (0/1 loss) for this KNN model on the test data.
"""
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(X_training, Y_training) # Train KNN
y_pred = knn_model.predict(X_testing) # Predict
test_error = zero_one_loss(Y_testing, y_pred) # Calculate test error
return test_error
def generate_knn_cross_validation_errors(k, n_folds,X_training, Y_training, X_testing, Y_testing):
"""
Implement a n-fold cross-validation procedure to select the optimal value of K for the
K-nearest neighbors (KNN) classifier.
k = n_neighbors
n_folds = number of folds
Performs below tasks:
=====================
1> Train a KNN classifier with the selected K using the training data.
2> Predict labels for the test data.
3> Calculate the test error for this KNN model on the test data.
"""
# Define a range of K values to consider
k_values = range(10, k+1) # Consider K values from 10 to k
# Initialize variables to store cross-validation results
cv_scores = []
# Perform 10-fold cross-validation for each K value
for k in k_values:
knn_model_cv = KNeighborsClassifier(n_neighbors=k)
# Assuming X_train and y_train are your training data
scores = cross_val_score(knn_model_cv, X_training, Y_training, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# Find the optimal K with the highest cross-validation score
optimal_k = k_values[cv_scores.index(max(cv_scores))]
# Train a KNN classifier with the optimal K using the full training dataset
knn_model_optimal = KNeighborsClassifier(n_neighbors=optimal_k)
knn_model_optimal.fit(X_training, Y_training)
# Predict labels for the test data
y_pred_cv = knn_model_optimal.predict(X_testing)
# Calculate the test error for this KNN model
test_error_cv = zero_one_loss(Y_testing, y_pred_cv)
return test_error_cv
def generate_bayes_error(X_test, Y_test, class_1_centers, class_2_centers ):
"""
mu_k0 = class_1_centers,
mu_k1 = class_2_centers
Performs below tasks:
====================
1> Predict labels for the test data using the Bayes rule.
2> Calculate the test error for this Bayes model on the test data.
"""
mu_k0 = class_1_centers
mu_k1 = class_2_centers
# Bayes prediction
y_pred_bayes = bayes_rule(X_test,mu_k0,mu_k1)
# Calculate the test error for this Bayes model
test_error_bayes = zero_one_loss(Y_test, y_pred_bayes)
return test_error_bayes
def visualize_errors_using_box_plot(test_errors_k7,test_errors_cv, test_errors_bayes):
"""
Generates combined box plot for procedures :
1> KNN with K = 7,
2> (KNN with 10-Fold CV to Select K), and
3> Bayes Rule Error
"""
# the test errors for the three procedures: test_errors_k7, test_errors_cv, test_errors_bayes
# Convert the lists into a DataFrame
df = pd.DataFrame({
'Method': ['KNN with K=7'] * 50 + ['KNN with CV'] * 50 + ['Bayes'] * 50,
'Test Error': test_errors_k7 + test_errors_cv + test_errors_bayes
})
# Create the box plot using Plotly
fig = px.box(df, x='Method', y='Test Error', title='Comparison of Test Errors')
fig.update_traces(marker=dict(size=4)) # Adjust marker size for better visibility
fig.show()
def generate_test_errors(num_iterations,k,n_folds,training_sample_size,testing_sample_size,mu_k0, mu_k1):
# Simulate data generation (Part 1)
# Generate 20 centers and use them to create training/test datasets in the simulation loop.
# Number of iterations (50 datasets)
# num_iterations = 50
# Lists to store test errors for each procedure
test_errors_k7 = []
test_errors_cv = []
test_errors_bayes = []
# Simulation Dataset for Training and Testing
# training_sample_size = 200 # total training set 100 * (number of classes = 2) = 200
# testing_sample_size = 10000 # total training set 5000 * (number of classes = 2) = 10,000
# Simulation loop
for _ in range(num_iterations):
# Generate training and test datasets (Part 2a)
# Sample 200 points for training and 10,000 points for testing from the 20 centers.
X_training_simulation,Y_training_simulation,X_testing_simulation,Y_testing_simulation = generate_simulation_dataset(mu_k0,mu_k1,training_sample_size,testing_sample_size)
# KNN with K=7 (Part 2b)
test_errors_k7.append(generate_knn_errors(k,X_training_simulation,Y_training_simulation,X_testing_simulation,Y_testing_simulation))
# KNN with CV-selected K (Part 2c)
# Implement 10-fold cross-validation here to select the optimal K.
test_errors_cv.append(generate_knn_cross_validation_errors(kcv_min,n_folds, X_training_simulation,Y_training_simulation,X_testing_simulation,Y_testing_simulation))
# Bayes Rule (Part 2d)
# Implement the Bayes rule based on the known data generation process.
test_errors_bayes.append(generate_bayes_error(X_testing_simulation,Y_testing_simulation,mu_k0,mu_k1))
return test_errors_k7,test_errors_cv, test_errors_bayes
def generate_analysis_results(test_errors_k7,test_errors_cv, test_errors_bayes):
# Assuming you have the test errors for the three procedures: test_errors_k7, test_errors_cv, test_errors_bayes
# Create a DataFrame with procedure names as rows and analysis statistics as columns
data = {
'Procedure': ['KNN with K=7', 'KNN with CV-selected K', 'Bayes Rule'],
'min_test_error': [np.min(test_errors_k7), np.min(test_errors_cv), np.min(test_errors_bayes)],
'max_test_error': [np.max(test_errors_k7), np.max(test_errors_cv), np.max(test_errors_bayes)],
'median_test_error': [np.median(test_errors_k7), np.median(test_errors_cv), np.median(test_errors_bayes)],
'quantiles_test_error': [np.percentile(test_errors_k7, [25, 75]), np.percentile(test_errors_cv, [25, 75]), np.percentile(test_errors_bayes, [25, 75])]
}
# Create the DataFrame
analysis_df = pd.DataFrame(data)
return analysis_df
def perform_simulation(num_iterations,k,n_folds,training_sample_size,testing_sample_size,mu_k0, mu_k1):
"""
Main Method to generate simulation
"""
# Step 1: Generate Test Errors for each procedures i.e. KNN with K = 7, (KNN with 10-Fold CV to Select K)
# and Bayes Rule Error
test_errors_k7,test_errors_cv, test_errors_bayes = generate_test_errors(num_iterations,k,n_folds,training_sample_size,testing_sample_size,mu_k0, mu_k1)
# Step 2: Generate comparision Plots:
visualize_errors_using_box_plot(test_errors_k7,test_errors_cv, test_errors_bayes)
# Step 3: Generate Analysis Results:
result = generate_analysis_results(test_errors_k7,test_errors_cv, test_errors_bayes)
return result
result = perform_simulation(50,7,10,200,10000,mu_k0, mu_k1)
result
| Procedure | min_test_error | max_test_error | median_test_error | quantiles_test_error | |
|---|---|---|---|---|---|
| 0 | KNN with K=7 | 0.1710 | 0.2135 | 0.18555 | [0.17922500000000002, 0.19159999999999996] |
| 1 | KNN with CV-selected K | 0.1631 | 0.1964 | 0.17620 | [0.17140000000000002, 0.18100000000000005] |
| 2 | Bayes Rule | 0.1576 | 0.1744 | 0.16510 | [0.1634, 0.16792500000000002] |